2024 Plant Biology and Data Science (PURE-PD)1 Lizzie, Dialo, and Jeremy pictured
I spent the Summer of 2024 in a small town in Indiana killing plants. I worked at Purdue University (Go Boilermakers!) in the plant pathology department. There, I had the pleasure of working in Dr. Anjali Iyer-Pascuzzi’s Lab with my mentor, Abbie2 Shout out to Abbie for being the best mentor ever. Her lab specializes in characterizing and analyzing Ralstonia solanacearum which is a pathogenic bacterium that causes damage to potatoes, tomatoes, and geraniums. I learned how to do colonization assays, plant infiltrations, and ROS assays to determine how certain effectors within Ralstonia work. Effectors are proteins that are injected into a host organism to create more favorable colonization conditions3 Effectors can come in all shapes and sizes, but my project will mostly investigate Type 3 ones. My interest in effectors did not end when I returned to Utah.
The best part about Ralstonia is that, in certain media, it has a Barbie Pink color:
With all that sentimental stuff done, here is the project:
There are A LOT of different effectors within Ralstonia solanacearum. This led me to wonder how they could be connected and how they have evolved in different strains of this bacterium.
After many hours of looking for a good data set, I thought I found one. I was wrong. I found a useful table from Plant-Host Interactions (PHI-base). This database collects different effectors from many pathogens. While this is a cool data set, I soon realized that many of the genes are mislabeled, there is A TON of missing data for Ralstonia, and I was not going to get very far with it 4 I probably should have stopped after some questionable cleaning. This is where I realized I would need a new data set.
That data set was not completely useless, so here are some interesting connections from information about Ralstonia from PHI-base.
## Rows: 58 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ProteinID, GeneLocusID, Gene_name, Pathogen_species, Disease_name, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Ok as you can see the information from there was not entirely helpful to my project as there weren’t that many strains of Ralstonia solanacearum recorded. And the effectors were mislabeled and often duplicated.5 Ok I’ll stop complaining
I started work with another data set that contained many Rip (Ralstonia injected proteins) types and strain variants. 6 New data set from this really cool paper: “Repertoire, unified nomenclature and evolution of the Type III effector gene set in the Ralstonia solanacearum species complex” by Nemo Peeters et al. Here is a table to look through them:7 Feel free to click through all 63 pages
## Warning: package 'reactable' was built under R version 4.4.3
## Rows: 622 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): #Species, accession, type, name, description
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
After getting this data, I had to pull the protein sequences from NCBI’s GenBank. The only problem, there are 622 sequences. No matter how much I believe in my laptop, it won’t survive that request.
So, I had to do it in batches:8 Shoutout to many internet searches on how to get this to work and generative AI
library(rentrez)
library(seqinr)
#getting only the accession numbers
ral_dfaccession <- ral_df$accession
#getting sequences
ids <- ral_dfaccession
# split into batches of 20 (all my computer can handle)
batch_size <- 20
id_batches <- split(ids, ceiling(seq_along(ids) / batch_size))
all_seqs <- list()
for (i in seq_along(id_batches)) {
cat("Downloading batch", i, "\n")
#FASTA sequences
fasta <- entrez_fetch(db="protein", id=id_batches[[i]], rettype="fasta", retmode="text")
# FASTA into individual sequences
temp <- tempfile()
writeLines(fasta, temp)
seqs <- read.fasta(temp, seqtype="AA", as.string=TRUE)
all_seqs <- c(all_seqs, seqs)
}
These FASTA sequences were saved. Next, their names had to be changed.9 A cruel punishment to make me rename everything For those out there who must suffer as I did, here is the code for that:
#renaming so not accession numbers
sequences <- read.FASTA("ralproteins.fasta", type="AA")
#accession numbers as sequence names
names_df <- data.frame(accession = names(sequences))
#merge tables to find where differences and get rid of different ones
merged_df <- merge(names_df, ral_df, by = "accession", all.x = TRUE)
merged_df$fullname <- paste(merged_df$name, merged_df$`#Species`)
# renamed sequences
names(sequences) <- merged_df$fullname
#saved the renamed sequences
write.fasta(sequences, names = names(sequences), file.out = "ral_fullname.fasta")
At last, we can move on.10 Most of these were NOT problems I thought about. Consider this in the future
Following this, the sequences have to be aligned. There are a couple different options, but after doing some research I think I will use the msa package as it is efficient and maybe won’t melt my computer.
I will need to break up the Rip proteins quite a bit to get my computer to run anything.11 I am writing this after many hours of troubleshooting due to file size
Here is an small preview of an alignment for the RipU proteins from different species:12 Labels for sequences are missing because they simply did not fit, my bad
In starting the alignment, different proteins need to be compared across the different strains of Ralstonia. The msa package was used here with the alignment type= “Muscle”. This was used, not only because it has a cool name, but also because it was said to be good for accurate sequence alignment, even when the sequences are short. 13 As is the case with most effectors.
For RipU, here is a small tree to show the connections of this effector within different bacterial strains. 14 RipU is close to my heart since I worked with it so much
The code:
#for RipU
sequences <- readAAStringSet("ral_fullname.fasta")
#keep sequences with "ripU"
subset_sequences <- sequences[grep("ripU", names(sequences))]
#align
alignRipU <- AlignSeqs(subset_sequences)
#tree
DRipU <- DistanceMatrix(subset_sequences, correction="none", # choose a model
type="dist",
processors=NULL) # use all CPUs )
tree_ripU <- Treeline(myDistMatrix=DRipU,
method="ME",
showPlot=FALSE,
processors=NULL)
A pretty plot:
Now let’s make this better. Here is another chart with all of the “ripA” effectors. These effectors serve multiple purposes in suppressing transcription factors, altering pathways, and destroys the structure of microtubules.15 This uses the same process as above so I did not include the code
#for all effectors
sequences_all <- readAAStringSet("ral_fullnamereal.fasta")
subset_sequencesripA <- sequences_all[grep("ripA", names(sequences_all))]
#aligning
alignRipA <- AlignSeqs(subset_sequencesripA)
#tree
allripA <- DistanceMatrix(subset_sequencesripA, correction="none", # choose a model
type="dist",
processors=NULL) # use all CPUs )
tree_rips <- Treeline(myDistMatrix=allripA,
method="ME",
showPlot=FALSE,
processors=NULL)
interactivetreeA <- ggtree(tree_rips)+ geom_tiplab()
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextGGtree() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in mean.default(to): argument is not numeric or logical: returning NA
## Warning: `line.width` does not currently support multiple values.
## Warning: 'scatter' objects don't have these attributes: 'node', 'parent'
## Valid attributes include:
## 'cliponaxis', 'connectgaps', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'fill', 'fillcolor', 'fillpattern', 'groupnorm', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hoveron', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'orientation', 'selected', 'selectedpoints', 'showlegend', 'stackgaps', 'stackgroup', 'stream', 'text', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
Ralstonia solanacearum is found on all continents. As crops move, this bacteria has the potential to spread throughout and diversify. Some strains are event becoming cold resistant and they are able to survive in new climates. Building a phylogenetic tree allows us to investigate where strains are evolving from and how they can be prevented from moving forward.
I would not melt my computer trying to make a really big tree even though it would’ve been really cool.